Excel is a general spreadsheet tool, with some statistical capability
Excel may need care when importing data, as it does not have statistical datatypes, and converts data to new forms silently.Minitab is a general-purpose statistical package, with an intuitive graphical interfaceR is a general-purpose programming language with an emphasis on statistics and data scienceMinitab and R have modes of operation that enforce good data practice; Excel does notR has an extremely wide range of libraries and packages for specialised and advanced statistical analysis, which are not available in either Minitab or ExcelTo participate in this workshop or to work through the example in your own time, you will need the following on your own computer:
OpenRefine dataset from the “Cleaning Data With OpenRefine” workshop: esoph-tab.tsv (or whatever you named it)Excel (you have a licence for this via the university)Minitab (you have a licence for this via the university)R and RStudioPlease see the configuration notebook for help and guidance on setting up R and RStudio.
To download and install the latest versions of Excel and Minitab please visit the university’s IT pages.
Data can be collected, stored, and analysed in a variety of ways, using a wide range of tools. We do not prescribe any particular approach, other than to note best practices for FAIR and op science prefer open, plain-text, human-readable non-proprietary data formats that are shared in public repositories. The best tool for data analysis for your work may differ depending on your area, and from project to project. Most good tools will allow you to read and write data in exchangeable formats compatible with FAIR and open science. In this workshop we will explore three common tools: Excel, Minitab and R, and use these to open and analyse our cleaned dataset esoph-tab.tsv from the earlier workshop.
We will perform three main tasks in each package:
ExcelMicrosoft Excel is almost ubiquitous. It is now available free of charge as a cloud service, as part of Office3651. Alternatives to Excel, such as Google Sheets2 and Libre Office3, work in much the same way. This workshop should be adaptable to those alternatives with minimal modification.
ExcelOpenesoph-tab.tsv, select it and click OpenFor any format other than its own, proprietary .xls or .xlsx format, Excel will start the Text Import Wizard.
This makes loading the simplest datasets seem more complicated than it really is, and serves to nudge users towards Microsoft’s own format, and away from good data practice.
The Text Import Wizard should correctly identify that your data are delimited4.
However, Excel may not notice that the first row is a header row5, and may treat all rows equivalently.
Next >On the next page of the wizard, Excel indicates that it has identified the tab (\t) character as the column delimiter. You would be able to select a different character if this was incorrect. Here, no changes are necessary.
Next >In the next wizard page, Excel allows you to select each column’s data format. These formats are: General, Text, Date, and “skip column”.
Excel does not allow you here to specify data formats or data types useful for statistical or data analysis.
In particular, it does not distinguish between categorical and numerical data. It does not allow you to specify whether numerical values should be counts (whole numbers) or “real” values (can take any decimal value).
Importing data into Excel has very limited data validation.
FinishExcel now presents your data in spreadsheet format.
Here, Excel again attempts to nudge us towards using one of the proprietary Excel formats by claiming that there is **Possible Data Loss:** Some features might be lost if you save this workbook in the comma-delimited (.csv) format. To preserve these features, save it in Excel format.
This is misleading. Our data should “safe” so long as we keep it in plain text, open, non-proprietary formats.
Excel were able to distinguish header rows from data rows?Excel import our data correctly?Excel did not import our data correctly.
Excel interpreted the range 10-19 as a date: October 19th. Our tobgp column now has a number of dates interspersed among the values. This is bad: Excel has changed our data without asking or notifying us that it would do so.
To avoid this problem, we must specify Text as the data format for our columns:
ExcelOpenesoph-tab.tsv, select it and click OpenNext >Next >TextFinishExcel has now imported our data, and it looks to be correct.
Excel import our data correctly?To obtain summary statistics for each column we need to use Excel’s functions and pivot tables, as we need to treat categorical data differently from numerical data. We want to know, for instance, the number of datapoints we have in each agegp category, rather than a “mean” of the categories (which has no real meaning); but for our cases and controls, we might still want to know means, standard deviations, etc.
In all cases, we are potentially modifying our cleaned dataset directly because we are making changes to the same file we loaded - this would be bad practice.
Our first step should be to save a new spreadsheet to do our analysis in
Excel workbook (note: this is necessary to avoid losing Excel’s calculations and formatting)We will calculate mean and standard deviation values for columns ncases and ncontrols. To do so, we need to use Excel’s AVERAGE and STDEV.S functions.
ncases values run from cell D2 to cell D89ncontrols values run from cell E2 to cell E89D90 enter the text =AVERAGE(D2:D89) and hit the return key. This will calculate the mean of quantitative data in the column of cells extending from D2 to D89, and populateExcel calculate the average value? If so, what is it?Excel did not calculate the mean value.
When we selected Text as the input data format, Excel used this to do two things:
To rectify this problem, we need to change those two columns to General format:
D and E (D2:E89)warning icon (a small yellow triangle with an exclamation mark)Convert to NumberD90 represents, add a label to cell C90 with the text Mean:.D90 into cell E90Std Dev: to cell C91=STDEV(D2:D89) into cell D91D91 into cell E91Now we have some summary information for our quantitative data.
It is often useful to know how many datapoints you have in each category. Here, we could independently sum each of the ncases or ncontrols cells, depending on which mixture of categorical variables we wanted (e.g. combine all cells for rows where tobgp is equal to 10-19), but Excel provides a tool called a pivot table to make things easier for us.
For instance, suppose we wanted to know the numbers of cases and controls in each category of the alcgp column. We could insert a pivot table that combined all values from ncases and ncontrols, for each of the individual categories in alcgp. To do this:
A2:E89)Insert \(\rightarrow\) Pivot Table \(\rightarrow\) OKThis brings up an interface that might at first look confusing. We are going to try to create a table that has one row per category in alcgp, and holds three columns:
alcgp categoryncases for that categoryncontrols for that categoryTo do so:
alcgp - this will place it in the Rows box, and you will see a view of the new pivot tablencases - this will place it in the Values box as Sum of ncases and you will see the pivot table changencontrols - this will place it in the Values box as Sum of ncontrols and you will see the pivot table change120+ categorySuppose we want to compare the numbers of cases and controls for each of our alcgp categories, we can reasonably use a bar graph (though we should really use a 1D scatterplot). To do so:
A3:C7 in the figure)Insert \(\rightarrow\) Pivot ChartThis will create a bar chart in the current worksheet, showing the number of controls and cases in each category of alcgp.
MinitabMinitab is a statistical software package, often used for teaching. It is moving towards a cloud-hosted model, but the Minitab Express package is still available for desktop use. This software requires a licence, which you will need to obtain from the university before you can run the tool.
Also, because of licensing restrictions, you must be connected to the university VPN so that the licensing server can be reached. Otherwise, you will not be able to use Minitab.
Minitab ExpressFile \(\rightarrow\) Open Worksheetesoph-tab.csv, select it and click OpenMinitab Express will not import tab-separated value files.
Minitab Express presents your data in a panelled window. By default, the data is presented in spreadsheet form in the lower panel. The top panel will contain output, and the left panel will be a Navigator. This layout reflects Minitab’s role as a statistical analysis package, rather than a general spreadsheet tool.
Minitab Express did not ask any questions about data type or formats.
To see how Minitab Express imported your data:
Inspector button (top right)A new window appears. Click on the Column icon (third icon).
Minitab Express correctly distinguished between the header and data rows?Minitab Express think each column is?Minitab Express “know about”?Excel?As a statistical package, Minitab Express is written to make obtaining summary statistical data relatively straightforward. Here we will try to recover the same data as we did in Excel.
To obtain summary statistics for a single data column in Minitab Express:
Summary \(\rightarrow\) Column StatisticsC4 ncases and click on the arrow next to VariableOKMinitab then updates the spreadsheet with calculated values and presents some output in the top panel, describing what values were calculated, and which cells in the spreadsheet hold output values.
To generate a set of descriptive statistics for a data column:
Summary \(\rightarrow\) Descriptive StatisticsC5 ncontrols and click on the arrow next to VariableOKMinitab doesn’t put these values in the spreadsheet, but it does show a nicely-formatted table with a number of descriptive statistics. You can switch between views on the data by clicking on previous steps in the Navigator (on the left, by default).
Minitab give you the same information/values as Excel for this data?Minitab give you any useful extra data?We will again try to obtain case and control counts, conditioned on the alcgp category.
Graphs \(\rightarrow\) Bar ChartMean or other function of a continuous variableMultiple Y variables \(\rightarrow\) Clusteredncases and ncontrols as the continuous variablesalcgp as the categorical variableSum as the functionMinitab, unlike Excel, shows both the tabular summary and the bar graph visualisation in the upper panel.
Excel or Minitab? Why do you think that was?RR is a general programming language, designed to be used by statisticians and data scientists. Unlike Excel and Minitab it is free and open-source. RStudio is a free, open-source integrated development environment (IDE) designed for use with R. It has powerful version control and project management features, and a very large number of third-party libraries are available for statistical, bioinformatics, and other activities. It is the de facto language of choice for much statistical and bioinformatics work in academia.
RStudioFiles panel, navigate to the location of the esoph-tab.tsv fileesoph-tab.tsv fileClicking on the esoph-tab.tsv file shows the contents of the file in a navigation window (top left by default), but does not load or import the dataset
Working in R/RStudio is unlike working in a point-and-click environment such as Excel or Minitab. Commands are issued by typing them in the Console window (bottom left, by default) or by writing and running scripts.
Today, we will write and run a simple script
Before we load the data, we need to make sure we are in the correct working directory, and create an empty script.
Files panel, click MoreSet as Working DirectoryIn the Console window, you should see that a command has been run.
To create an empty script:
File \(\rightarrow\) New File \(\rightarrow\) R ScriptThis will start a new file in the top left Editor panel. It will be called Untitled1.
To load the data from esoph-tab.tsv we need to use the command read.table(). Type the text below into the empty panel Untitled1:
# Load Oesophagus case/control data
# First row is a header
data = read.table("esoph-tab.tsv", header=TRUE, sep="\t")
Then click Run. This will bring up a “Save File” dialogue. Save the file as summarise_esoph.R
You can get help on any R command by typing its name into the search field of the Help panel (bottom left, by default)
When script commands are run in RStudio, you can see them being executed in the Console panel.
Here, it’s worth breaking down what you’ve written in the script. The first two lines start with the “hash”/“pound”/octothorpe character. This means that the line is a comment and not treated like a command. This is one way in which we annotate and document our code, so it is understandable when we return to it in future (and so others can understand it).
The data is loaded on line three: data = read.table("esoph-tab.tsv", header=TRUE, sep="\t", stringsAsFactors=TRUE). This does the following:
read.table()) the file esoph-tab.tsv
R that the first line of the file/first row of the table is a header (header=TRUE)R that the separator/delimiter is a tab character (sep="\t")R that the columns containing strings are categorical variables (stringsAsFactors=TRUE)When the data is loaded, the Environment panel (top right, by default) shows that data exists, with 88 observations of 5 variables.
data: this shows a summary of the datasetdata: this shows the data values in spreadsheet format in the Editor panel
The data variable contains an object called a dataframe8 in R jargon.
R read each column of the dataset in as?Excel or Minitab?To get basic summary statistics from a dataframe, we use the summary() command in R. Add the following lines to your script, select that line, and then Run the script:
# Obtain a summary of each column in the loaded data
summary(data)
This generates output in the Console panel:
> summary(data)
agegp alcgp tobgp ncases ncontrols
25-34:15 0-39 :23 0-9 :24 Min. : 0.000 Min. : 1.00
35-44:15 120+ :21 10-19:24 1st Qu.: 0.000 1st Qu.: 3.00
45-54:16 40-79 :23 20-29:20 Median : 1.000 Median : 6.00
55-64:16 80-119:21 30+ :20 Mean : 2.273 Mean :11.08
65-74:15 3rd Qu.: 4.000 3rd Qu.:14.00
75+ :11 Max. :17.000 Max. :60.00
Minitab?So far, the output does not include standard deviations for ncases and ncontrols. We can get this information by using the command sd() in our script, and selecting each column separately. To indicate a column, we use the variable name, followed by the $ symbol, and then the column name. So, to obtain standard deviations for our two columns in the data dataframe, we add the following lines to our script:
# Calculate standard deviations of the number of cases and controls
sd(data$ncases)
sd(data$ncontrols)
and running the script then gives us the result:
> # Calculate standard deviations of the number of cases and controls
> sd(data$ncases)
[1] 2.753169
> sd(data$ncontrols)
[1] 12.7227
Google Sheets is a cloud-based spreadsheet package, free for personal use↩︎
Libre Office is a free, open office quite intended as a drop-in, free replacement for Microsoft Office↩︎
delimited: separated by a character that indicates a new column is starting. This is often a tab (\t) or comma (,)↩︎
header row: a row in a table that indicates the contents of each column↩︎
In Excel, numbers are right-aligned, text is left-aligned. Each cell also has a green triangle in its upper left corner to indicate that it is a text value, not numerical↩︎
You can think of a variable as a box with a label on it. The label is the variable name (here, data), and the contents of the box are the data that was loaded. When you want to talk about the box or its contents, you can do so with the variable name.↩︎
The term dataframe refers to data stored according to good data practices: each row is an observation; each column is a variable↩︎